32 research outputs found
CERN: Confidence-Energy Recurrent Network for Group Activity Recognition
This work is about recognizing human activities occurring in videos at
distinct semantic levels, including individual actions, interactions, and group
activities. The recognition is realized using a two-level hierarchy of Long
Short-Term Memory (LSTM) networks, forming a feed-forward deep architecture,
which can be trained end-to-end. In comparison with existing architectures of
LSTMs, we make two key contributions giving the name to our approach as
Confidence-Energy Recurrent Network -- CERN. First, instead of using the common
softmax layer for prediction, we specify a novel energy layer (EL) for
estimating the energy of our predictions. Second, rather than finding the
common minimum-energy class assignment, which may be numerically unstable under
uncertainty, we specify that the EL additionally computes the p-values of the
solutions, and in this way estimates the most confident energy minimum. The
evaluation on the Collective Activity and Volleyball datasets demonstrates: (i)
advantages of our two contributions relative to the common softmax and
energy-minimization formulations and (ii) a superior performance relative to
the state-of-the-art approaches.Comment: Accepted to IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 201
Learning Social Affordance Grammar from Videos: Transferring Human Interactions to Human-Robot Interactions
In this paper, we present a general framework for learning social affordance
grammar as a spatiotemporal AND-OR graph (ST-AOG) from RGB-D videos of human
interactions, and transfer the grammar to humanoids to enable a real-time
motion inference for human-robot interaction (HRI). Based on Gibbs sampling,
our weakly supervised grammar learning can automatically construct a
hierarchical representation of an interaction with long-term joint sub-tasks of
both agents and short term atomic actions of individual agents. Based on a new
RGB-D video dataset with rich instances of human interactions, our experiments
of Baxter simulation, human evaluation, and real Baxter test demonstrate that
the model learned from limited training data successfully generates human-like
behaviors in unseen scenarios and outperforms both baselines.Comment: The 2017 IEEE International Conference on Robotics and Automation
(ICRA
Discovering Generalizable Spatial Goal Representations via Graph-based Active Reward Learning
In this work, we consider one-shot imitation learning for object
rearrangement tasks, where an AI agent needs to watch a single expert
demonstration and learn to perform the same task in different environments. To
achieve a strong generalization, the AI agent must infer the spatial goal
specification for the task. However, there can be multiple goal specifications
that fit the given demonstration. To address this, we propose a reward learning
approach, Graph-based Equivalence Mappings (GEM), that can discover spatial
goal representations that are aligned with the intended goal specification,
enabling successful generalization in unseen environments. Specifically, GEM
represents a spatial goal specification by a reward function conditioned on i)
a graph indicating important spatial relationships between objects and ii)
state equivalence mappings for each edge in the graph indicating invariant
properties of the corresponding relationship. GEM combines inverse
reinforcement learning and active reward learning to efficiently improve the
reward function by utilizing the graph structure and domain randomization
enabled by the equivalence mappings. We conducted experiments with simulated
oracles and with human subjects. The results show that GEM can drastically
improve the generalizability of the learned goal representations over strong
baselines.Comment: ICML 2022, the first two authors contributed equally, project page
https://www.tshu.io/GE